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Abstract. We consider the specification of prior distributions for Bayes- 
ian model comparison, focusing on regression-type models. We propose 
a particular joint specification of the prior distribution across models so 
that sensitivity of posterior model probabilities to the dispersion of prior 
distributions for the parameters of individual models (Lindley's paradox) 
is diminished. We illustrate the behavior of inferential and predictive pos- 
terior quantities in linear and log-linear regressions under our proposed 
prior densities with a series of simulated and real data examples. 
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1. INTRODUCTION AND MOTIVATION 

A Bayesian approach to inference under model 
uncertainty proceeds as follows. Suppose that the 
data y are considered to have been generated by 
a model m, one of a set M of competing models. 
Each model specifies the distribution of Y, f(y\m, 
(3 m ) apart from an unknown parameter vector /3 m £ 
B m , where B m is the set of all possible values for the 
coefficients of model m. We assume that B m = 1Z dm 
where d m is the dimensionality of (3 m . 

If /(m) is the prior probability of model m, then 
the posterior probability is given by 

(1) /Wy) - WW me M, 
22 f{m)f{y\m) 
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where f{y\nn) is the marginal likelihood calculated 
using /(y|m) = f f(y\m, f3 m )f(/3 m \m) df3 m and 
f((3 m \m) is the conditional prior distribution of (3 m , 
the model parameters for model m. Therefore 

f(m\y) oc/(m)/(y|m), m e M. 

For any two models mi and 777-2, the ratio of the 
posterior model probabilities (posterior odds in fa- 
vor of mi ) is given by 

^ /(rai|y) _ /(mi) /(y|mi) 

/(ra 2 |y) f(m 2 ) f(y\m 2 ) ' 

the ratio of prior probabilities multiplied by the ra- 
tio of marginal likelihoods, also known as the Bayes 
factor. 

The posterior distribution for the parameters of 
a particular model is given by the familiar expression 

/(/3 m Ky) « /(/3 m |"7)/(y|/3 m ,m), m e M. 

For a single model, a highly diffuse prior on the 
model parameters is often used (perhaps to repre- 
sent ignorance). Then the posterior density takes 
the shape of the likelihood and is insensitive to the 
exact value of the prior density function, provided 
that the prior is relatively flat over the range of pa- 
rameter values with nonnegligible likelihood. When 
multiple models are being considered, however, the 
use of such a prior may create an apparent diffi- 
culty. The most obvious manifestation of this occurs 
when we are considering two models mi and m 2 
where mi is completely specified (no unknown pa- 
rameters) and mi has parameter f3 m2 and associated 
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prior density /(/3 m |m2). Then, for any observed 
data y, the Bayes factor in favor of mi can be made 
arbitrarily large by choosing a sufficiently diffuse 
prior distribution for /3 m2 (corresponding to a prior 
density f((3 m2 \rri2) which is sufficiently small over 
the range of values of (3 m2 with nonnegligible likeli- 
hood). Hence, under model uncertainty, two differ- 
ent diffuse prior distributions for model parameters 
might lead to essentially the same posterior distribu- 
tions for those parameters, but very different Bayes 
factors. 

This result was discussed by Lindley (1957) and 
is often referred to as "Lindley's paradox" although 
it is also variously attributed to Bartlett (1957) and 
Jeffreys (1961). As Dawid (2011) pointed out, the 
Bayes factor is only one of the two elements on the 
right side of (2) which contribute toward the pos- 
terior model probabilities. The prior model proba- 
bilities are of equal significance. By focusing on the 
impact of the prior distributions for model param- 
eters on the Bayes factor, there is an implicit un- 
derstanding that the prior model probabilities are 
specified independently of these prior distributions. 
This is often the case in practice, where a uniform 
prior distribution over models is commonly adopted, 
as a reference position. Examples where nonuniform 
prior distributions have been suggested include the 
works of Madigan et al. (1995), Chipman (1996), 
Laud and Ibrahim (1995, 1996), Chipman, George 
and McCulloch (2001), Cui and George (2008), Ley 
and Steel (2009) and Wilson et al. (2010). We pro- 
pose a different approach where we consider how the 
two elements of the prior distribution under model 
uncertainty might be jointly specified so that per- 
ceived problems with Bayesian model comparison 
can be avoided. This leads to a nonuniform spec- 
ification for the prior distribution over models, de- 
pending directly on the prior distributions for model 
parameters. 

A related issue concerns the use of improper prior 
distributions for model parameters. Such prior dis- 
tributions involve unspecified constants of propor- 
tionality, which do not appear in posterior distri- 
butions for model parameters but do appear in the 
marginal likelihood for any model and in any as- 
sociated Bayes factors, so these quantities are not 
uniquely determined. There have been several at- 
tempts to address this issue, and to define an ap- 
propriate Bayes factor for comparing models with 
improper priors; see Kadane and Lazar (2004) for 
a review. In such examples, Dawid (2011) proposed 



that the product of the prior model "probability" 
and the prior density for a given model could be 
determined simultaneously by eliciting the relative 
prior "probabilities" of particular sets of parame- 
ter values for different models. He also suggested 
an approach for constructing a general noninfor- 
mative prior, over both models and model param- 
eters, based on Jeffreys priors for individual mod- 
els. Although the prior distributions for individual 
models are not generally proper, they have densi- 
ties which are uniquely determined and hence the 
posterior distribution over models can be evaluated. 
Clyde (2000) proposed a similar approach where the 
priors for parameters of individual models are uni- 
form and the relative weights of different models are 
chosen by constraining the resulting posterior model 
probabilities to be equivalent to those resulting from 
a specified information criterion, such as BIG 

Here, we do not consider improper prior distribu- 
tions for the model parameters, but our approach is 
similar in spirit as we do explicitly consider a joint 
specification of the prior over models and model pa- 
rameters. 

We focus on models in which the parameters are 
sufficiently homogeneous (perhaps under transfor- 
mation) so that a multivariate normal prior density 
N(fi m ,V m ) is appropriate, and in which the like- 
lihood is sufficiently regular for standard asymp- 
totic results to apply. Examples are linear regres- 
sion models, generalized linear models and standard 
time series models. In much of what follows, with mi- 
nor modification, the normal prior can be replaced 
by any elliptically symmetric prior density propor- 
tional to \V\- l / 2 g({l3 - v) T V~ l {P ~ /*)) where 
f£°T l g(r 2 )dr < oo and d is the dimensionality 
of (3. This includes prior distributions from the mul- 
tivariate t or Laplace families. Similarly, our ap- 
proach can also be adapted to common prior dis- 
tributions for parameters of graphical models. 

We choose to decompose the prior variance matrix 
as V m = c^S m where c m represents the scale of the 
prior dispersion and S m is a matrix with a spec- 
ified value of |S m |, although for the remainder of 
this section we do not require an explicit value; fur- 
ther discussion of this issue is presented in Section 2. 
Hence, suppose that 

fiPm\™) 
(3) =(27r)- dm / 2 |E m |- 1 /2 c -<i m 

-ex P^ — 2 C 2 (@rn~ Mm) (Pm ~ Fm)j • 
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Then, 

/(m|y)oc/(m) J f(y\m,(3 m )f(f3 m \m)d(3 m 

= /(m)(2^)- d ™/ 2 |S m |- 1 / 2 C - d - 
(4) i 

• / exp( - — (/3 m -/2 m ) T S~ 1 

• (/3m - /Oj /(y K /3m) d An 

and for suitably large c m , 

/(m|y) « /(m)(2vr)-^/ 2 |S m r 1 / 2 c-^ 

(5) 

in • 

JTl d m 

Hence, as c m gets larger, f(m\y) gets smaller, as- 
suming everything else remains fixed. Therefore, for 
two models of different dimension with the same 
value of c 2 ^, the posterior odds in favor of the more 
complex model tend to zero as c 2 ^ gets larger, that 
is, as the prior dispersion increases at a common 
rate. This is essentially Lindley's paradox. 

There have been substantial recent computational 
advances in methodology for exploring the model 
space; see, for example, Green (1995, 2003), Kohn, 
Smith and Chan (2001), Denison et al. (2002), Hans, 
Dobra and West (2007). The related discussion of 
the important problem of choosing prior parame- 
ter dispersions has been largely focused on ways to 
avoid Lindley's paradox; see, for example, Fernandez, 
Ley and Steel (2001) and Liang et al. (2008) for de- 
tailed discussion on appropriate choices of g-priors 
for linear regression models and Raftery (1996) and 
Dellaportas and Forster (1999) for some guidelines 
on selecting dispersion parameters of normal priors 
for generalized linear model parameters. Other ap- 
proaches which have been proposed for specifying 
default prior distributions under model uncertainty 
which provide plausible posterior model probabil- 
ities include intrinsic priors (Berger and Pericchi, 
1996) and, for normal linear models, mixtures of g- 
priors (Liang et al., 2008). The important effect that 
any of these prior specifications might have on the 
parameter posterior distributions within each model 
has been largely neglected. For example, a set of 
values of c m might be appropriate for addressing 
model uncertainty, but might produce prior den- 
sities f(/3 m \m) that are insufficiently diffuse and 
overstate prior information within certain models. 
This has a serious effect on posterior and predic- 
tive densities of all quantities of interest in any data 



analysis. This is a particularly important considera- 
tion when posterior or predictive inferences are inte- 
grated over models (model-averaging) . In such anal- 
yses both the prior model probabilities and prior 
distributions over model parameters can have a sig- 
nificant impact on inferences. 

In this paper we propose that prior distributions 
for model parameters should be specified with the 
issue of inference conditional on a particular model 
being the primary focus. For example, when only 
weak information concerning the model parameters 
is available, a highly diffuse prior may be deemed 
appropriate. The key element of our proposed ap- 
proach is that sensitivity of posterior model prob- 
abilities to the exact scale of such a diffuse prior 
is avoided by suitable specification of prior model 
probabilities f{m). As mentioned above, these prob- 
abilities are rarely specified carefully, a discrete uni- 
form prior distribution across models usually being 
adopted. However, it is straightforward to see that 
setting f(m) cx c^™ in (5) will have the effect of elim- 
inating dependence of the posterior model probabil- 
ity f(m\y) on the prior dispersion c m . This provides 
a motivation for investigating how prior model prob- 
abilities can be chosen in conjunction with prior dis- 
tributions for model parameters, by first considering 
properties of the resulting posterior distribution. 

The strategy described in this paper can be viewed 
as a full Bayesian approach where the prior distribu- 
tion for model parameters is specified by focusing on 
the uncertainty concerning those parameters alone, 
and the prior model probabilities can be specified 
by considering the way in which an associated "in- 
formation criterion" balances parsimony and good- 
ness of fit. In the past, informative specifications for 
these probabilities have largely been elicited via the 
notion of imaginary data; see, for example, Chen, 
Ibrahim and Yiannoutsos (1999) Chen et al. (2003). 
Within the approach suggested here, prior model 
probabilities are specified by considering the way in 
which data yet to be observed might modify one's 
beliefs about models, given the prior distributions 
for the model parameters. Full posterior inference 
under model uncertainty, including model averag- 
ing, is then available for the chosen prior. 

2. PRIOR AND POSTERIOR DISTRIBUTIONS 

We consider the joint specification of the two com- 
ponents of the prior distribution by investigating its 
impact on the asymptotic posterior model probabil- 
ities. This allows us to investigate, across a wide 
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class of models, the sensitivity of posterior infer- 
ences to the specification of prior model probabil- 
ities and prior distributions for model parameters. 
By using Laplace's method to approximate the pos- 
terior marginal likelihood in (4), we obtain, subject 
to certain regularity conditions (see Kass, Tierney 
and Kadane, 1988; Schervish, 1995, Section 7.4.3) 

~ 1/2 c m dm 7(yK3 m ) 



f(m\y) oc /(m)|S, 
(6) .expl-—(p 



\Tv-l/ 



•|c- 2 S- 1 - J ff(3 m )|- 1/2 (l + P (n- 1 )), 

where n is the sample size, (3 m is the maximum like- 
lihood estimate and H(f3 m ) is the second derivative 
matrix for log/(y|m,/3 m ). Then, 

log/(m|y) 

= C + log/(m) - i log | S m | -d m log c m 



+ log/(y|m,/3 
1 



2r 2 V^m 



(fir 



(7) 



- -loglc^S- 1 -HiPJl + O^n- 1 ) 
C + log/(m) 

- -log|S m | - d m \ogc m + log/(y|m,3 r 



2c 1 



A*m) (3?- 



dm 1 

log 71 

2 & 



ilog|i(3 m )|+0 P (n- 1 / 2 ) 



where C is a normalizing constant to ensure that the 
posterior model probabilities sum to 1 and i((3 m ) ~ 
—n~ 1 H((3 m ) is the Fisher information matrix for 
a unit observation; see Kass and Wasserman (1995). 

We propose specifying the decomposition of the 
prior variance matrix (4S m so that |S m | = |i(/3 m )| _1 , 
resulting in 

log f(m\y) = C + log f(y \m, 3 m ) 

2^,2 ^fifn ^to) (.fim Mm) 

(8) m 

+ log/(m) -ci m logc m 

-^logn + O^n- 1 ^), 



where c 2 defined as 



can be interpreted as the number of units of infor- 
mation in the prior. 

Note that substituting c m = 1 (unit information) 
into (8), and choosing a discrete uniform prior dis- 
tribution across models, suggests model comparison 
on the basis of a modified version of the Schwarz cri- 
terion (BIC; Schwarz, 1978) where maximum likeli- 
hood is replaced by maximum penalized likelihood. 
In a comparison of two nested models, Kass and 
Wasserman (1995) gave extra conditions on a unit 
information prior which lead to model comparison 
asymptotically based on BIC; see Volinsky and Raf- 
tery (2000) for an example of the use of unit infor- 
mation priors for Bayesian model comparison. For 
regression-type models where the components of y 
are not identically distributed, depending on explana- 
tory data, the unit information as defined above 
potentially changes as the sample size changes, so 
a little care is required with asymptotic arguments. 
We assume that the explanatory variables arise in 
such a way that i((3 m ) = him(P m ) + OirT 1 ! 2 ) where 
i\i m (Pm) 1S a finite limit. This is not a great restric- 
tion and is true, for example, where the explanatory 
data may be thought of as i.i.d. observations from 
a distribution with finite variance. 

In general, i(/3 m ) depends on the unknown model 
parameters, so the number of units of information c~ 2 
corresponding to any given prior variance matrix V m 
will also not be known, and hence it is not gener- 
ally possible to construct an exact unit information 
prior. Dellaportas and Forster (1999) and Ntzoufras, 
Dellaportas and Forster (2003) advocated substitut- 
ing fj, m , the prior mean of (3 m , into i((3 m ) to give 
a prior for model comparison which has a unit in- 
formation interpretation but for which model com- 
parison is not asymptotically based on BIC. 

When the prior distribution for the parameters 
of model m is highly diffuse, so that c m is large, 
then (8) can be rewritten as 



log/(m|y) «C + log/(y|m,3^ 



(10) 



+ log f(m) - d m log c m - -y 1 log n, 



(9) 



(\v m \\i(f3 m )\)- 1/d " 



where (3 m ^ s the maximum likelihood estimate of (3 m ■ 
Equation (10) corresponds asymptotically to an in- 
formation criterion with complexity penalty equal to 
log?i + \ogc 2 m — 1d m x log/(m) compared with BIC, 
for example, where the complexity penalty is equal 
to logn. The relative discrepancy between these two 
penalties is asymptotically zero. Poskitt and Tre- 
mayne (1983) discussed the interplay between prior 
model probabilities f(m) and BIC and other infor- 
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mation criteria in a time series context when Jeffreys 
priors are used for model parameters. 

It is clear from (10) that a large value of c m arising 
from a diffuse prior penalizes more complex models. 
On the other hand, a more moderate value of c m 
(such as unit information) may have the effect of 
shrinking the posterior distributions of the model 
parameters toward the prior mean to a greater ex- 
tent than desired. This has a particular impact when 
model averaging is used to provide predictive infer- 
ences (see, e.g., Hoeting et al., 1999), where both the 
posterior model probabilities and the posterior dis- 
tributions of the model parameters are important. 
A conflict can arise where to achieve the amount of 
dispersion desired in the prior distribution for model 
parameters, more complex models are unfairly pe- 
nalized. To avoid this, we suggest choosing the dis- 
persion of the prior distributions of model parame- 
ters to provide the amount of shrinkage to the prior 
mean which is considered appropriate a priori, and 
to choose prior model probabilities to adjust for the 
resulting effect this will have on the posterior model 
probabilities. We propose 

(11) f(m)(xp(m)c d ™, 

where p{m) are baseline model probabilities. The pur- 
pose of decomposing prior model probabilities f(m) 
in this way is to explicitly specify a direct depen- 
dence between these probabilities and the hyperpa- 
rameters of the prior distributions for the parame- 
ters of each model. There is no requirement that p(m) 
be uniform, and any differences between f(m) for 
different m which are unrelated to the prior dis- 
tributions for the model parameters are absorbed 
in p(m). Often, we might expect p(m) not to depend 
on the dimensionalities of the models, although we 
do not prohibit this. With this choice of f(m), (8) 
becomes 

log f{m\y) = C + log f(y\m, 3 m ) 

(12) — — —((3 m — /x m ) T S m 1 (/3 m — fj, m ) 

+ logp(m) - ^logn + O p (n~ 1 / 2 ), 

where the specification of the base variance S m is 
not in terms of unit information, the extra term 
— log(|S m | • |i(/3 m )|)/2 is required in (12). When (? m 
is large and when all p(m) are equal, model compar- 
ison is asymptotically based on BIC. More generally, 
we propose choosing prior model probabilities based 
on (11) for any prior variance V m . Substituting (9) 



into (11), we obtain 

(13) f{m)^p{m){\V m \\i{(3 m )\) 1 ' 2 . 

The choice of p(m) can be based on the form of 
the equivalent model complexity penalty which is 
deemed to be appropriate a priori. Setting all p(m) 
equal, which we propose as the default option, leads 
to model determination based on a modified BIC 
criterion involving penalized maximum likelihood. 
Hence, the impact of the prior distribution on the 
posterior model probability through (/3 m — 

A t m) Ts m 1 (3 m ~ /0/ 2c m m ( 12 ) is straightforward 
to assess, and any undesirable side effects of large 
prior variances are eliminated. In Section 1, we dis- 
cussed existing approaches for specifying nonuni- 
form f{m) based on considerations such as the de- 
sire to control model size. These can easily be incor- 
porated into the specification of nonuniform p(m), 
if desired. Other possible approaches to specifying 
or eliciting p(m) are discussed in Sections 4 and 5. 

In order to specify prior model probabilities us- 
ing (11), with p(m) chosen to correspond to a par- 
ticular complexity penalty, it is necessary to be able 
to evaluate c~ 2 , the number of units of information 
implied by the specified prior variance V m for (3 m . 
Equivalently, as f(rn) oc p(rn)\V rn \ l l 2 \i(f3 m )\ 1 / 2 , 
knowledge of \i((3 rn )\ is required. Except in certain 
circumstances, such as normal linear models, this 
quantity depends on the unknown model parame- 
ters (3 m . This is not appropriate specification 
for the marginal prior distribution over model space. 
One possibility is to use a sample-based estimate 
N(/3?n)l to determine the "prior" model probability, 
in which case the approach is not fully Bayesian. Al- 
ternatively, as suggested above, substituting /x m , the 
prior mean of (3 m , into i(/3 m ) gives a prior for model 
comparison which has a unit information interpreta- 
tion but for which model comparison is not asymp- 
totically based on (12), the extra term log(|i(/i m )|/ 
\i((3 m )\)/2 being required. 

3. NORMAL LINEAR MODELS 

Here we consider normal linear models where for 
m G M, y ~ iV(X m /3 m , a 2 1) with the conjugate prior 
specification 

(3 m \a 2 ,m ~ N(n m ,a 2 V m ) and 

(14) 

a 2 ~ Gamma(a, A) . 

For such models the posterior model probabilities 
can be calculated exactly. Dropping the model sub- 
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script m for clarity, 
/My) 

\y*\l/2 
oc/(m)- 



(2A + y T y + V - /3 VT 1 /^^ 2 , 

where V* = {V~ l + X T X)~ 1 and /3 = l/*^" 1 /! + 
X r y) is the posterior mean. Hence, setting V = c 2 X, 
as before, 

log/(m|y) 

= C + log /(m) - i log Ic- 2 ^- 1 + X T X| 

- ^ lo g! s l -rflogc 
(15) -(a + n/2)log(2A + y T y + /x T y-V 

= C - (a + n/2) log(2A + (y - x3) T (y - X/3) 
+ (3-/x) T F- 1 (3-/x)) 



(16) 



+ iog/M) - 2 lo s HI 

— — log n — - log [S| — dlogc + 0{n~ 1 ) ) 

where, with a slight abuse of notation, i = n _1 X T X 
is the unit information matrix multiplied by a 2 . No- 
tice the correspondence between (7) and (16). As 
before, if |S| = \i\ , then c~ 2 can be interpreted as 
the number of units of information in the prior (as 
the prior variance is c 2 <7 2 £) and 

log /My) 

C-(a + n/2) log(2A + (y - x3) T (y - X/3) 

+ (p-v) T v- 1 (p-n)) 

+ log f(m) — — logra — dlogc + O^n" 1 ). 

In both (16) and (17) the posterior mean (5 can be 
replaced by the least squares estimator (3. Again, 
if c = 1 (unit information) and the prior distribu- 
tion across models is uniform, model comparison is 
performed using a modified version of BIC, as pre- 
sented for example by Raftery (1995), where n/2 
times the logarithm of the residual sum of squares 
for the model has been replaced by the first term 
on the right-hand side of (17). The residual sum 
of squares is evaluated at the posterior mode, and 



(17) 



is penalized by a term representing deviation from 
the prior mean, as in (7). This expression also de- 
pends on the prior for a 2 through the prior parame- 
ters a and A, although these terms vanish when the 
improper prior f(cr 2 ) cx cr~ 2 , for which a = A = 0, 
is used. With these values, and setting S _1 = i = 
n we obtain the prior used by Fernandez, 

Ley and Steel (2001), who also noted the unit infor- 
mation interpretation when c = 1 for all m. This is 
an example of a g-prior (Zellner, 1986). 

As before, if the prior variance V suggests a dif- 
ferent value of c, then the resulting impact on the 
posterior model probabilities can be moderated by 
an appropriate choice of f(m) and again we propose 
the use of (11) and (13), noting that for normal mod- 
els i is known. In the context of normal linear mod- 
els, Pericchi (1984) suggested a similar adjustment 
of prior model probabilities by an amount related 
to the expected gain in information. Alternatively, 
replacing |i| by li + n -1 !/" 1 ! in (13), resulting in 

(18) f(m) o^p{m)\V\ 1/2 \i + n^V- 1 ^ 12 , 

makes (16) exact, eliminating the 0(n _1 ) term. Again, 
for highly diffuse prior distributions on the model 
parameters (large values of c 2 ), together with a = 
A = and prior model probabilities based on (11) 
and (13), equation (17) implies that model compar- 
ison is performed on the basis of BIC. 



We note that when the g-prior £ 



n 



X J X 



is used, together with fj, = 0, then the posterior model 
probability (15) can be written as 

log /(m|y) 



(19) 



C + log f(m) — — log(n + c 2 ) — dlogc 



(a + n/2) log 2A + 



1 



1 + nc 



T 

:y y 



+ 



nc 



;S 2 J1-R 2 ) 



1 + nc 2 v 

where S 2 = Y^=i{Vi ~ v) 2 an d R 2 is the standard 
coefficient of determination for the model. For our 
prior, where f(m) (xp(m)c d , we obtain 

log /(m|y) 

= C + logp(m) — — log(n + c -2 ) 



(a + n/2) log 2A + 



1 + nc 2 



y y 



+ 



nc 



1 + nc 



S 2 (l — R 2 



2 y 
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The trade-off between model fit, as reflected by R 2 , 
and complexity, measured by d, is immediately ap- 
parent, with the complexity penalty tending to BIC 
as c~ 2 tends to zero. The posterior model probabil- 
ity (19) is similar to expression (5) of Liang et al. 
(2008). Their approach differs in that they consider 
the intercept parameter of the linear model sepa- 
rately, giving it an improper uniform prior, as this 
parameter is common to all models under consid- 
eration. Such a specification might also be adopted 
within our framework, both for linear models and 
for more general regression models. 

4. SPECIFICATION OF p(m) BASED ON 
RELATIONSHIP WITH OTHER 
INFORMATION CRITERIA 

In Sections 2 and 3, we have investigated how prior 
model probabilities might be specified by consider- 
ing their joint impact, together with the prior distri- 
butions for the model parameters, on the posterior 
model probabilities. It was shown that making these 
probabilities depend on the prior variance of the as- 
sociated model parameters using (11) or (13) with 
uniform p{m) leads to posterior model probabilities 
which are asymptotically equivalent (to order re" 1 / 2 ) 
to those implied by BIC. For models other than 
normal linear regression models, a prior value of (3 
must be substituted into (13) and so the approxi- 
mation only attains this accuracy for (3 within an 
neighborhood of this value. Nevertheless, 
we might expect BIC to more accurately reflect the 
full Bayesian analysis for such a prior than more gen- 
erally, where the error of BIC as an approximation 
to the log-Bayes factor is O(l). 

Alternative (nonuniform) specifications for p(m) 
might be based on matching the posterior model 
probabilities (8) using prior weights (13) with other 
information criteria of the form 

log/(y|m,3 m ) - \ij)(n)d m , 

where ip(n) is a "penalty" function; for BIC, ip(n) = 
logro and for AIC ip(n) = 2. From (12), for large 
or for a modified criterion, we have ip{n) = logn + 
2c?" 1 logp(m). As p(m) contributes to the prior model 
probability through (11) it cannot be a function of n 
since our prior belief on models should not change as 
the sample size changes. Therefore, strictly, the only 
penalty functions which can be equivalent to setting 
prior model probabilities as in (11) are of the form 
ip(n) = log n + ipQ for some positive constant ipo > 0. 
Any alternative dependence on n would correspond 
to a prior which depended on n, through /(?re) or 



f((3 m \m). Hence AIC, for example, is prohibited 
(as would be expected since AIC is not consistent), 
whereas any approach arising from a proper prior 
must be consistent. Nevertheless, if a penalty func- 
tion of a particular form is desired for a sample of 
a specified size reo , then setting log p(m) = {log no - 
ip(rio)} will ensure that posterior model probabili- 
ties are calculated on the basis of the information 
criterion with penalty tp(no), at the relevant sample 
size no- 

Clyde (2000) proposed CIC, a calibrated informa- 
tion criterion, based on a joint specification of (im- 
proper) uniform prior distributions for model pa- 
rameters, together with prior model probabilities 

1/2 

f(/3Jm)f{m) ex (2tt)- 



-dm/2 



where c is a constant which is determined by con- 
straining the posterior model probabilities to be the 
same as those which would arise from an alternative 
information criterion, such as BIC. For our prior, in 
the limit as c~ 2 — >• 0, we have 

f((3 m \m)f(m) <x (2^)-^/ 2 |S m |- 1 / 2 p(m) 

so in the case where |S m | = \i((3 m ) \ for a value of (3 m 
close to the m.l.e. these approaches will yield sim- 
ilar results if p(m) is calibrated to (n/c) rf//2 , which is 
plausible if c oc re. Note also that, if p(m) oc 
(27r) rfm//2 |vj m |V 2 , our p r ior in this limiting case re- 
duces to a uniform measure over the "parameter 
space" for (m,(3 m ). 

5. ALTERNATIVE ARGUMENTS FOR 

f(m) oc cf^ 

The purpose of the following discussion is not to 
advocate a particular prior, but simply to illustrate 
that one can arrive at (11) by direct consideration of 
prior probabilities, or prior densities, or by the be- 
havior of posterior means, as well as by the asymp- 
totic behavior of posterior model probabilities, or 
associated numerical approximations, as earlier. 

5.1 Constant Probability in a Neighborhood of 
the Prior Mean 

Specifying the prior distribution on the basis of 
how it is likely to impact the posterior distribu- 
tion is entirely valid, but may perhaps seem unnat- 
ural. In particular, the consequence that the prior 
model probabilities might depend on the prior dis- 
tributions for the model parameters may seem some- 
what alien. This is particularly true of the implica- 
tion of (13), that models where we have more infor- 
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mation (smaller dispersion) in the prior distribution 
should be given lower prior probabilities than mod- 
els for which we are less certain about the param- 
eter values. One justification for this is to examine 
the prior model probabilities for particular subsets 
of the parameter spaces within models. This can be 
considered as an extension of the approach of Robert 
(1993) for two normal models. We consider the prior 
probability of the event 

E = {model m is 'true'} 

n {((3 m - fi m ) T i((3° m )(f3 m - p m ) < e 2 } 

for some reference parameter value /3^, possibly the 
prior mean /j,. The dependence of this subset of 
the parameter space on the unit information at (3 m 
enforces some degree of comparability across mod- 
els. This is particularly true if the various values 
of /3 ? ° n are compatible (e.g., they imply the same 
linear predictor in a generalized linear model, as 
they would generally do if set equal to 0). For the 
purposes of the current discussion, we also require 
V m = c^z(/3^) _1 . This is a plausible default choice, 
but nevertheless represents considerable restriction 
on the structure of the prior variance, which was 
previously unconstrained. Then 

P(E) = f(m)p( X j m <^-) 

f(m)e dm 

~ 2'W 2 - 1 r(d m /2)cfr' 

for small e. Therefore, for this prior, if the joint 
prior probability of model m in conjunction with (3 m 
being in some specified neighborhood (defined ac- 
cording to a unit information inner product) of its 
prior mean is to be uniform across models, then 
we require f(m) oc p(m)c^™ as in (11), with p(m) = 
2 <W2-i r (d m /2)/ e rf ™. 

5.2 Flattening Prior Densities 

An alternative justification of (11) when the model 
parameters are given diffuse normal prior distribu- 
tions arises as follows. One way of taking a "base- 
line" prior distribution and making it more diffuse, 
to represent greater prior uncertainty, is to raise the 
prior density to the power 1/c 2 for some c 2 > 1, and 
then renormalize. For example, for a single normal 
distribution this has the effect of multiplying the 
variance by c 2 , which increases the prior dispersion 
in an obvious way. Highly diffuse priors, suitable 
in the absence of strong prior information, may be 
thought of as arising from a baseline prior trans- 
formed in this way for some large value of c 2 . Where 



model uncertainty exists, the joint prior distribution 
is a mixture whose components correspond to the 
models, with mixture weights f{m). As suggested 
above, a diffuse prior distribution might be obtained 
by raising a baseline prior density (with respect to 
the natural measure over models and associated pa- 
rameter spaces) to the power 1/c 2 and renormaliz- 
ing. Where the baseline prior distribution for f3 m is 
normal with mean \x m and variance E m , the effect of 
raising the mixture prior density to the power 1/c 2 
is to increase the variance of each (3 m by a factor 
of Cp" , clS before. For large values of c 2 the effect of the 
subsequent renormalization is that the model prob- 
abilities are proportional to |£ m | 1 / 2 (27r) rfm / 2 c dm , in- 
dependent of the model probabilities in the original 
baseline mixture prior. Again this illustrates a rela- 
tionship between prior model probabilities and prior 
dispersion parameters satisfying (11). For the two 
normal models considered by Robert (1993) the re- 
sulting prior model probabilities are identical. Where 
the baseline variance is based on unit information, so 
|E m | = \i(/3 m ) | , then the prior model probabilities can 
be written as (13) withp(m) = (2vr) dm / 2 |i( / 3 m )|~ 1/2 . 

5.3 Bayesian Model Averaging and Shrinkage 

Finally, this approach can be justified by con- 
sidering the behavior of the posterior mean under 
model averaging. We restrict consideration here to 
two nested models, tuq and mi, differing by a single 
parameter (3 and suppose that f(y\mo) = f{y\mi,j3o)- 
We assume that the prior for (3 under mi is N(/3q,c 2 ), 
so the prior mean under model mi is the specified 
value of (3 under model mo, and, without loss of gen- 
erality, we take (3q = 0. Under model mi the Bayes 
estimator for (3 is the posterior mean Ei(f3\y), which 
has asymptotic expansion 

po, *<« = g(l -§)+_-_ + „(„-.), 

where na^ is the third derivative of the log-likelihood, 
evaluated at f3 (see, e.g., Johnson, 1970; Ghosh, 1994). 
This illustrates the usual effect of prior variance c 2 
and the corresponding prior precision c -2 as a shrink- 
age parameter, with the posterior mean being shrunk 
away from the m.l.e., with the amount of shrink- 
age diminishing as c -2 —¥ 0. Hence, for fixed y, the 
posterior mean for f3 is (asymptotically) monotonic 
in c -2 . Allowing for model uncertainty, we have 
E((3\y) = f(mi\y)Ei(f3\y) where 

(21) fimilv)= l + k(2*)l*cM0\v)' 
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Fig. 1. Model average coefficient on /3 [evaluated as P//3], 
for normal likelihood with known error variance, a . The 
plot here is for n — 10, /J = l,a 2 = 1. The dashed line is for 
a uniform prior over models, and the solid line uses prior 
model probability /(mi) occ -1 . The dotted lines are approxi- 
mations based on replacing (27r) 1//2 /i (0 1 y ) in (21) with its nor- 
mal approximation exp(— '^" /j 2 ), ignoring the dependence, 
to 0{n~% ofh{Q\y) on <r». 

where fi(/3\y) is the posterior (marginal) density 
for /3 under mi, and k are the prior odds in fa- 
vor of mo over mi- Combining (20) and (21), we see 
that, in E((3\y), the model-averaged posterior mean 
for /3, the m.l.e. j3 is multiplied by a shrinkage co- 
efficient, f(mi\y)Ei(/3\y), which is not a monotonic 
function of the prior precision for /3 and hence c -2 no 
longer has a simple interpretation as a shrinkage pa- 
rameter. A simple illustration of this is provided by 
Figure 1, where this coefficient is plotted for various 
values of c -2 , for the simple example of a normal dis- 
tribution with known error variance, and prior odds 
k = 1, corresponding to a uniform prior on model 
space. Note that a high value of the coefficient on f3 
corresponds to low shrinkage. It can be seen that, re- 
gardless of the value of c -2 , there is a certain amount 
of shrinkage toward the prior mean and the shrink- 
age is not a monotone function of c~ 2 . For values 
of c -2 greater than 0.5, the shrinkage to the prior 
mean is an approximately linearly increasing func- 
tion of c -2 as expected. For small values of c -2 , pos- 
terior probability is increasingly concentrated on mo 
as c -2 decreases (Lindley paradox) and hence the 
model-averaged estimate is increasingly shrunk to 
the prior mean. Adopting the approach advocated 
in this paper has the effect of setting k oc c" 1 which 



mitigates this effect, and returns control over the 
shrinkage to the analyst. 

6. ILLUSTRATED EXAMPLES 

We illustrate our approach in a series of simula- 
tions and real data applications. For comparison, we 
also present results under other prior specifications, 
notably the hyper g-prior of Liang et al. (2008), 
for which computation is performed using the BAS 
package; see Clyde (2010). 

Section 6.1 illustrates that unit information prior 
specifications (or other specifications suggesting 
smaller prior parameter dispersion) can indeed sig- 
nificantly shrink posterior distributions toward zero. 
This effect suggests that although prior variances 
based on unit information might have desirable be- 
havior with respect to model determination, they 
may unintentionally distort the parameter posterior 
distributions. We demonstrate that this can affect 
the predictive ability of routinely used model aver- 
aging approaches in which information is borrowed 
across a set of models. 

In Section 6.2 we illustrate the effect of Lindley's 
paradox in a standard linear regression context em- 
phasizing its dramatic effect on inference concerning 
model uncertainty. At the same time, we demon- 
strate that if instead of using the standard discrete 
uniform prior distribution for f(m) we adopt our 
proposed adjusted prior distribution given by (11) 
with p(m) = 1, the prior distribution for the model 
parameters can be made highly diffuse in a way 
which does not impact strongly on the posterior 
model probabilities. 

Finally, Section 6.3 investigates the behavior of 
posterior model probabilities when substantive prior 
information about the parameters is available. We 
demonstrate through a real data example that the 
uniform prior on models may have a significant im- 
pact on posterior model probabilities and we illus- 
trate the advantages of specifying prior model prob- 
abilities that are appropriately adjusted for param- 
eter prior dispersions. 

6.1 Example 1: A Simple Linear Regression 
Example 

Montgomery, Peck and Vining (2001) investigated 
the effect of the logarithm of wind velocity (x), mea- 
sured in miles per hour, on the production of elec- 
tricity from a water mill (y), measured in volts, via 
a linear regression model of the form 

yi~ N(f3 + (3iXi,a 2 ), i = l,...,n 
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Fig. 2. Posterior densities of parameters fJo and fi\ under different prior dispersions; <? m — (? for all models m for Example 6.1. 



based on n = 25 data points. We calculate the pos- 
terior odds of the above model, denoted by mi, 
against the constant model denoted by mo, adopting 
the usual conjugate prior specification given by (14) 
with zero mean, variance V m = c,^ri(X^X m ) -1 and 
a = A = 10 -2 . Since there is a high sample corre- 
lation coefficient of 0.978 between y and x, we ex- 
pect that mi will be a posteriori strongly preferred 
to mo- Indeed, the posterior probability of mi is 
very close to 1 for values of c 2 ^ as large as 10 28 . 
This behavior provides a source of security with re- 
spect to the choice of c 2 ^ and Lindley's paradox, 
and we use this example to investigate the effect 
of c 2 ^ on the posterior densities of (3q and j3\; see 
Figure 2. We have used values of <? m that represent 
highly diffuse priors with (? m = 10 and = 100, the 
unit information prior that approximates BIC with 
c 2 ^ = 1, a prior that approximates AIC for this sam- 
ple size (? m = (e 2 — l)/n = 0.256 and a prior sug- 
gested by the risk inflation criterion (RIC) of Foster 
and George (1994) with (? m = 0.04; see also George 
and Foster (2000). It is striking that the resulting 
posterior densities differ highly in both location and 
scale. The danger of misinformation when unit in- 
formation priors are used was discussed in detail by 
Paciorek (2006). 

We also investigated how the Zellner and Siow 
(1980) prior and the Liang et al. (2008) hyper g- 
prior behave in this example. With the recommended 
hyperparameter values 2 < a < 4, these priors pro- 
duced posterior densities close to the low informa- 
tion g-prior with c 2 ^ = 100; see Figure 2. The results 
are quite robust across this range for a and, for ex- 
ample, quite large values of a, around 20, are requir- 



ed before the level of shrinkage becomes compara- 
ble to the unit information g-prior. Hence inferences 
arising from the hyper- g prior are quite robust across 
the recommended range of hyperparameter values. 

Finally, we examined the effect of intrinsic priors 
on posterior distributions for model parameters. We 
adopted the approach of Perez and Berger (2002) to 
construct an intrinsic (or expected posterior) prior 
by setting as a baseline prior the g-prior with c 2 = 
100 and the null model as a reference. For this simple 
linear regression model the minimal training sam- 
ple has size n* = 3. The resulting posterior distribu- 
tions of /3o and /3i, also shown in Figure 2, are in 
close agreement with the baseline g-prior. However, 
in variable selection problems the minimal training 
sample is usually set so that the full model can be 
estimated. Hence, the value of n* could be much 
higher if more covariates were available and this 
would affect the prior variance of the parameters. 
As an example, we have calculated the posterior 
densities of /3o and /3i when n* = 20, also displayed 
in Figure 2. The effect of the prior densities to the 
posterior distributions is dramatic. This nicely illus- 
trates the effect of the training sample size in intrin- 
sic priors; see the relevant discussion in Berger and 
Pericchi (2004). 

We now investigate the effect of prior specification 
when prediction is of primary interest. A common 
way of evaluating predictive performance is to com- 
pute the negative cross-validation score (see Geisser 
and Eddy, 1979) given by 

n 

s = -]TiogHj), 
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(a) g-prior (V m = c^n(X^X m ) x ). 

Fig. 3. Negative cross-validation log-likelihood for two prior 
prior (dashed line) for Example 6.1. 
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(b) Independence prior (V m = c^I dm ). 

rsion structures with uniform prior (solid line) and adjusted 



where 

P{j)= f( m )f(.yj\y\ji m ) 

is the model-averaged predictive density of observa- 
tion yj given the rest of the data y\j . Lower values 
of S indicate greater predictive accuracy. Following 
Gelfand (1996) we estimate f p (j) from an MCMC 
sample by the inverse of the posterior (over m,(3 m ) 
mean of the inverse predictive density of observa- 
tion j. 

We generated three additional covariates that have 
correlation coefficients 0.99, 0.97 and 0.89 with x 
and performed the same model determination ex- 
ercise. Posterior model probabilities for all models 
were calculated for all models under consideration. 
We used a g-prior with Vm = c^ n n('X.J n X. rn )~ 1 and 
an independent prior with V m = c^Irf m . For the uni- 
form prior on models combined with the unit infor- 
mation prior obtained by = 1, S is far away from 
the minimum value achieved for higher values of ; 
see Figure 3(a). For (? m > 10 5 , S increases due to the 
effect of Lindley's paradox focusing posterior prob- 
ability on models that are unrealistically simple. On 
the other hand, our proposed adjusted prior specifi- 
cation achieves the maximum predictive ability for 
any large value of (? m \ see Figure 3(b). The same 
exercise was also repeated for the hyper-g prior for 
various values of the hyperparameter a. The corre- 
sponding negative cross-validation score was close to 
the stabilized value of the g-prior and it was proven 
to be very robust for a wide range of values of a. 
Only for a very close to 2, did predictive ability start 
to deteriorate in a similar fashion to the g-prior. 



This simulated data exercise does indicate that 
predictive ability can be optimized if highly dis- 
persed prior parameter densities are chosen together 
with the adjusted prior over model space. Alterna- 
tively, in this example, the hyper-g family is suf- 
ficiently robust to simultaneously provide a diffuse 
prior for model parameters, together with reason- 
able behavior under model uncertainty. 

6.2 Example 2: Simulated Regressions 

We now consider the first simulated dataset of 
Dellaportas, Forster and Ntzoufras (2002) based on 
n = 50 observations of 15 standardized independent 
normal covariates Xj, j = 1, . . . , 15, and a response 
variable Y generated as 

(22) Y ~ N(X 4 + X 5 , 2.5 2 ). 

Assuming a conjugate normal inverse gamma prior 
distribution given by (14) with zero mean, V m = 
(? m ^m and a = A = 10 -2 , we calculated posterior 
model probabilities for all models under consider- 
ation. Similar behavior is exhibited either when E m 
is specified as S m = n(X^X m )~ 1 (described below) 
or as S m = I dm . 

Figure 4(a) and (b), illustrates the behavior of 
the posterior model probabilities, under a uniform 
prior on model space, of three indicative models. For 
the parameters we used the g-prior and the hyper-g 
prior with c 2 ^ = 2n~ 1 /(a — 2) obtained by equat- 
ing the shrinkage proportion g/(g — 1) of the g-prior 
with its prior mean under the hyper-g prior. The ef- 
fect of Lindley's paradox is more evident for the g- 
prior where all posterior probabilities are quite sen- 
sitive to the values of while the hyper-g prior 
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(a) Zellner's g-prior with uniform prior on model space. 
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(b) Hyper-g prior with uniform prior on model space. 
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(c) Zellner's g-prior with adjusted prior on model space. 

Fig. 4. Posterior model probabilities under different prior 
dispersions for the Dellaportas, Forster and Ntzoufras (2002) 
dataset of Section 6.2 generated using (22). Solid line: con- 
stant model; short dashed line: I + A4 + X5 model; long dashed 
line: 1 + X4 + X5 + X12 model. 

demonstrates a remarkable robustness for a wide 
range of prior parameter values and only for quite 
large values of (? m which correspond to values of a 
close to 2 is Lindley's paradox exhibited. We note 



that the hyper-g prior seems to result in increased 
uncertainty on model space resulting in lower pos- 
terior model probabilities for the higher posterior 
probability models. 

By contrast, using the adjusted prior in Figure 4(c) 
identifies 1 + X4 + X5 + X12 as the highest probabil- 
ity model for any value of c 2 ^ > 1. Note that, when 
= Ji(X^X m ) _1 , (? m = \ represents the dispersion 
induced by the unit information prior. Similarly, 
Figure 5 summarizes the posterior inclusion prob- 
ability of each variable Xj. Again, for the uniform 
prior these probabilities are sensitive to changes in 
across its range, whereas the adjusted prior produces 
stable results for c, 2 n > 1. 

In a more detailed simulation study, we repeated 
the above analysis by generating 100 datasets of the 
same model. The distribution of the posterior model 
probabilities over the 100 simulated datasets rein- 
forces the findings of the one-sample based simu- 
lation. We also repeated the above simulation ex- 
periment with a more challenging simulated dataset 
based on a simulation structure suggested by Nott 
and Kohn (2005). Each dataset consisted of n = 50 
observations and p = 15 covariates and one response 
generated using the following sampling scheme: 

AT (0,1) for j = 1, ... , 10 
AT(0.3Xi + 0.5X 2 + 0.7AT 3 
+ 0.9X4 + 1.1X5,1) 
for j = 11, . . . , 15 
Y ~ N(4 + 2Xi - X 5 + I.5X7 
+ Xn+0.5X 13 ,2.5 2 ) 



(23) 



Xj - 
Xj - 



The general conclusions of this study are in close 
agreement with the results obtained above. Further 
details are available in the electronic supplement 
which is available at http : //stat-athens . aueb . gr/ 
~ jbn/papers/paper24 .htm. 

6.3 Example 3: A 3 x 2 x 4 Contingency Table 
Example with Available Prior Information 

We consider data presented by Knuiman and Speed 
(1988) to illustrate how our proposed methodology 
performs in an example where prior information for 
the model parameters is available. The data consist 
of 491 individuals classified in n cells by categorical 
variables obesity (O: low, average, high), hyperten- 
sion (H: yes, no) and alcohol consumption (A: 1, 
1-2, 3-5, 6+ drinks per day). We adopt the nota- 
tion of the full hierarchical log-linear model used by 
Dellaportas and Forster (1999): 



JJi 



Poisson(Aj) for i = 1, 2, 



, n, 



log(A) = X/3, 
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(a) Zellner's g-prior with uniform prior on model space. 
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(b) Hyper-g prior with uniform prior on model space. 
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(c) Zellner's g-prior with adjusted prior on model space. 

Fig. 5. Posterior variable inclusion probabilities under dif- 
ferent prior dispersions for the Dellaportas, Forster and Nt- 
zoufras (2002) dataset of Section 6.2 generated using (22). 



where A = (Ai, . . . , A n ) T , X is the n x n design ma- 
trix of the full model, f3 = (J3j ; j € V) is an n x 1 
parameter vector, (3j are the model parameters that 



correspond to j term and V is the set of all terms 
under consideration. All parameters here are defined 
using the sum-to-zero constraints. Dellaportas and 
Forster (1999) proposed as a default prior for pa- 
rameters of log-linear models 



(24) 



with [ij being a vector of zeros and fc| = 2n for 
all j g V = {0, O, H, A, OH, OA, HA, OH A}; we de- 
note this prior by DF. 

In their analysis, Knuiman and Speed (1988) took 
into account some prior information available about 
the parameters (3j. In particular, prior to this study 
information was available indicating that (3 OHA 
and /3qa are negligible and only V = {0,O,H,A, 
OH, HA} should be considered. Moreover, the 
term ftjiA * s nonzero with a priori estimated effects 
~i3 T HA = (0.204,-0.088,-0.271); note that the signs 
of the prior mean are opposite when compared with 
reported values of Knuiman and Speed since we have 
used a different ordering of the variable levels. 

Knuiman and Speed adopted the prior (24) with 
[ijiA = ft ha aR d fij = for j € V \ {HA} and prior 
variance coefficients k HA = 0.05 and k? = oo for j € 
{0, 0, H, A, OH}. In our data analysis we used fcj = 
10 4 instead of kj = oo. We denote this prior as KS. 
We also used a combination of the DF and KS priors, 
denoted by KS/DF, modifying slightly the KS prior 
so that kj = 2n for terms j G {0, 0, H, A, OH}. Fi- 
nally, an additional diffuse independence prior, de- 
noted by IND, with zero prior mean and variance 
10 3 for all model parameters was also used. 

In log-linear models i((3 m ) depends on (3 m so to spe- 
cify the adjusted prior we utilize the prior mean fj, m 
of j3 m resulting in 

/(m)ap(m)|y m r /2 |X^Diag(A )X m | 1 / 2 n-^/ 2 , 

A = exp(X m /x m ), 

while the prior parameters pirn) were set equal to 
logp(m) = — 4=r log(2) in line with the DF prior. 

Posterior model probabilities (estimated using re- 
versible jump MCMC) for all prior specifications are 
presented in Table 1 . The top right panel of the table 
illustrates the striking effect of informative parame- 
ter priors on posterior model probabilities. The dif- 
ficulty of making joint inferences on parameter and 
model space is evident by inspecting the sensitivity 
of model probabilities to different priors. However, 
the specification for adjusting the prior model prob- 
abilities has the effect that posterior model proba- 
bilities are robust under all prior specifications. 
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Table 1 

Prior and posterior model probabilities under different parameter and model prior densities for Example 6.3 





Parameter 
prior 


Model 
space prior 




Prior model probabilities 




Posterior model probabilities 




O + H + A OH + A 


O + HA 


OH + HA 


O + H + A 


OH + A 


O + HA 


OH + HA 


1. 


DF 


uniform 


0.25 


0.25 


0.25 


0.25 


0.657 


0.336 


0.004 


0.002 


2. 


KS 


uniform 


0.25 


0.25 


0.25 


0.25 


0.075 


0.000 


0.923 


0.002 


3. 


KS/DF 


uniform 


0.25 


0.25 


0.25 


0.25 


0.059 


0.023 


0.638 


0.280 


4. 


DF 


adjusted 


0.247 


0.247 


0.251 


0.255744 


0.677 


0.317 


0.004 


0.002 


5. 


KS 


adjusted 


0.046 


0.954 


2.0 x 10~ 6 


3.3 x 10~ 5 


0.665 


0.335 


0.000 


0.000 


6. 


KS/DF 


adjusted 


0.500 


0.500 


1.7 x 10" 5 


1.7 x 10~ 5 


0.690 


0.310 


0.000 


0.000 


7. 


IND 


adjusted 


0.003 


0.996 


3.0 x 10~ 6 


0.001 


0.690 


0.303 


0.004 


0.003 



7. CONCLUSION 

There are clearly alternative specifications for the 
prior model probabilities p(m) which satisfy (11), 
and we do not seek to justify one over the other. 
Indeed, choosing model probabilities to satisfy (11) 
may not be appropriate in some situations. Hence, 
we do not propose (11) as a necessary condition 
for f(m) although we do believe that there are com- 
pelling reasons for considering such a specification, 
perhaps as a default or reference position in the 
type of situations we have considered in this paper. 
What we do argue is that there is nothing sacred 
about a uniform prior distribution over models, and 
hence by implication, about the Bayes factor. It is 
completely reasonable to consider specifying f{m) 
in a way which takes account of the prior distribu- 
tions for the model parameters for individual mod- 
els. Then, certainly within the contexts discussed 
in this paper, as demonstrated by the examples we 
have presented, the issues surrounding the role of 
the prior distribution for model parameters, in ex- 
amples with model uncertainty, become much less 
significant. 
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